Local grammars in word counting
نویسندگان
چکیده
The results of word counting in text depend on the level of its linguistic annotation. If a text can is regarded as a sequence of alphabetic character strings, without any information on their possible linguistic interpretation we are talking about the rough text. Some quantitative characteristics of texts can be obtained by the application of formal operations on a rough text, but these results can differ significantly from the results obtained on the annotated text. Moreover, the results of formal operation on an annotated text change with the change in the level and precision of text annotation. In this paper we consider the conditions under which the rough text can be supplied with the morphosyntactic information that will transform it into a linguistically relevant object. By achieving this goal the examination of text structure, including its quantitative characteristics, becomes possible. The approach can be illustrated with the following example. If in the rough Serbian text string mora occurs, then it can represent the realization of one of the several lexicon elements: mora.N ‘dread’, more.N ‘sea’, and morati.V ‘to have to’. The results of formal operations will depend on how the equality among the text elements is defined. Namely, on the level of rough text the string mora is the unique text element, on the simplest level of grammatical annotation mora.N and mora.V are two different elements, while on the next level mora,mora.N, mora,more.N and mora,morati.V become three different elements. If we proceed further in this manner we can obtain as much as ten different objects, since mora,mora.N can be the form of nominative singular and genitive plural, mora,more.N can be the form of genitive singular and nominative, genitive, accusative and vocative plural, and mora,morati.V can be the second person singular aorist tense and the third person singular present and aorist tense. The goal of this paper is to investigate to what extent it is possible to transform a text into a precisely annotated linguistical object by the use of formal methods only. The formal methods used for the processing of Serbian texts are based on the theory of finite state automata (FSA) and transducers (FST). All lexical data, as well as text that is analyzed are represented in this form. We will show how FSTs can be used to obtain the linguistically annotated text that is both rich with linguistic information and precise. The limitations of such approach are not known in advance although the natural languages are essentially finite and can be formally reduced to regular expressions. (see Kornai 1999). In the section 2 of this paper we will briefly describe the model of electronic dictionary and its application to Serbian. In the section 3 we will describe some problems that are encountered in attempt to obtain the precisely annotated text and we will outline the suggested solutions. Finally, in section 4 we will give some directions for future work
منابع مشابه
Local Grammars for the Description of Multi{Word Lexemes and their Automatic Recognition in Texts
Most multi{word lexemes (MWLs) allow certain types of variation. This has to be taken into account for their description to be able to recognize them in texts. We suggest to describe their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules, which at the same time permit to express regularities valid for a whole class of MWLs such as word order variation in Ger...
متن کاملA Hybrid Language Model based on Stochastic Context-free Grammars
This paper explores the use of initial Stochastic Context-Free Grammars (SCFG) obtained from a treebank corpus for the learning of SCFG by means of estimation algorithms. A hybrid language model is defined as a combination of a word-based n-gram, which is used to capture the local relations between words, and a category-based SCFG with a word distribution into categories, which is defined to re...
متن کاملIdarex: Formal Description of Multi-word Lexemes with Regular Expressions
Most multi-word lexemes (MWLs) allow certain types of variation. This has to be taken into account for their description and their recognition in texts. We suggest to describe their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules, which at the same time express in a general way regularities valid for a whole class of MWLs. The local grammars can be written ...
متن کاملA Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Grammar for Word Alignment
We present two contributions to grammar driven translation. First, since both Inversion Transduction Grammar and Linear Inversion Transduction Grammars have been shown to produce better alignments then the standard word alignment tool, we investigate how the trade-off between speed and end-to-end translation quality extends to the choice of grammar formalism. Second, we prove that Linear Transd...
متن کاملFormal Description of Multi-Word Lexemes with the Finite-State Formalism IDAREX
Most multi-word lexemes (MWLs) allow certain types of variation. This has to be taken into account for their description and their recognition in texts. We suggest to describe their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules, which at the same time allow to express in a general way regularities valid for a whole class of MWLs. The local grammars can be...
متن کاملPreRkTAG: Prediction of RNA Knotted Structures Using Tree Adjoining Grammars
Background: RNA molecules play many important regulatory, catalytic and structural <span style="font-variant: normal; font-style: norma...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007